Skip to content

[build] add Github Cache workflow and cancel-on-failure guard in bazel.yml#17575

Open
titusfortner wants to merge 2 commits into
trunkfrom
c/vigilant-feynman-72977b
Open

[build] add Github Cache workflow and cancel-on-failure guard in bazel.yml#17575
titusfortner wants to merge 2 commits into
trunkfrom
c/vigilant-feynman-72977b

Conversation

@titusfortner
Copy link
Copy Markdown
Member

Background

We've had frequent failures recently for not being able to download assets from 502 Gateway errors to certificate issues, etc.
One fix is to improve our cache so CI RBE workflow can use it instead of always downloading everything.
It has been disabled because it is quite large and Github limits us to 10GB.

I've already wired up a bunch of things in trunk already in order to verify that everything works; this PR is the final step and is as much for documentation the previous work.

The traditional repository-cache are the raw artifact downloads. In Bazel 7 --repo_contents_cache
was added which also stores extracted repos (somewhat duplicating what we have in setup-bazel with external cache).
Extracted repos are 2x the size of just the downloads and are not as useful on CI with one time use.

So I disabled --repo_contents_cache on the CI in 5f7df0d (and 9045e3b)
This brought repository cache small enough to enable on RBE (8f3b261)
Also we're deleting all the codeql caches that keep adding up since github can manage these for us (312e586 & 6f5004c)

The other issue we've had is that windows and mac repos are generated by whichever language happens to run first after the last cache was evicted.
This PR uses a Github Cache workflow to run bazel build --nobuild to generate Bazel repository cache for macOS and Windows.
So now all jobs will pull the repository cache and be able to use anything inside it, and only 1 job per OS will save cache.

RBE job generates the repository cache every time regardless of what tests run so it might as well save what it has when it is done.
Github cache jobs will run on trunk whenever something likely to have changed a download is changed, or once a day.
setup-bazel action saves cache unless the job has been cancelled, and for these we don't want broken builds to overwrite good cache,
so I've added a cancel-on-failure guard in bazel.yml to prevent poisoned cache saves.

Additional Considerations

Ideally we would toggle back on the repo_contents_cache and disable the external-cache in setup-bazel, but right now the total sizes of the windows/mac/linux repo caches would exceed 10GB.
A few ways we could improve that, but this work should address the current primary concern.

🤖 AI assistance

  • AI assisted
    • Tool(s): Claude
    • What was generated:
    • I reviewed all AI output and can explain the change

@selenium-ci selenium-ci added the B-build Includes scripting, bazel and CI integrations label May 27, 2026
@qodo-code-review
Copy link
Copy Markdown
Contributor

Review Summary by Qodo

Add Github Cache workflow and cancel-on-failure guard for CI

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Add Github Cache workflow to pre-generate Bazel repository cache for macOS and Windows
• Enable cache-save in gh-cache workflow with targeted build targets
• Add cancel-on-failure guard to prevent poisoned cache saves in bazel.yml
• Update workflow triggers to run on dependency file changes or daily schedule
• Grant actions write permission for workflow cancellation capability
Diagram
flowchart LR
  A["Dependency Changes<br/>or Daily Schedule"] -->|Trigger| B["Github Cache Workflow"]
  B -->|Pre-generate Cache| C["macOS & Windows<br/>Repository Cache"]
  D["Bazel Test Job"] -->|Use Cache| C
  D -->|Failure Detected| E["Cancel Run Guard"]
  E -->|Prevent Poisoned Cache| F["Skip Cache Save"]

Loading

Grey Divider

File Changes

1. .github/workflows/bazel.yml Error handling +6/-1

Add cancel-on-failure guard for cache poisoning

• Remove cache-version: 2 configuration line
• Add cancel-on-failure step that cancels workflow run if bazel fails and cache-save is enabled
• Step uses GitHub CLI to cancel run and requires actions write permission

.github/workflows/bazel.yml


2. .github/workflows/ci-rbe.yml ⚙️ Configuration changes +1/-0

Grant actions write permission for cancellation

• Add actions: write permission to allow workflow cancellation capability

.github/workflows/ci-rbe.yml


3. .github/workflows/gh-cache.yml ✨ Enhancement +21/-8

Enable Github Cache workflow with scheduled triggers

• Rename workflow from "CI Cache" to "Github Cache"
• Add push trigger on trunk with paths filter for dependency files (MODULE.bazel, Cargo.lock,
 maven_install.json, etc.)
• Add daily schedule trigger at 6:30 AM UTC
• Change concurrency group from ci-cache to gh-cache
• Add actions: write permission for workflow operations
• Enable cache-save: true to persist generated cache
• Update build targets to test-only paths (//java/test/..., //py:*, //rb/spec/..., etc.)
• Disable repo_contents_cache extraction to reduce cache size

.github/workflows/gh-cache.yml


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown
Contributor

qodo-code-review Bot commented May 27, 2026

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0)

Grey Divider


Action required

1. Cache triggers miss lockfiles 🐞 Bug ☼ Reliability ⭐ New
Description
The gh-cache workflow’s push path filter omits pnpm-lock.yaml and multitool.lock.json, so updates to
these Bazel dependency inputs won’t trigger cache population for macOS/Windows. Because MODULE.bazel
uses these files to generate external repositories, CI may still need to download new artifacts (and
hit the same transient 502/cert failures) until the next scheduled run.
Code

.github/workflows/gh-cache.yml[R4-15]

Evidence
The workflow only triggers on the listed paths, but MODULE.bazel explicitly uses pnpm-lock.yaml and
multitool.lock.json as inputs to external dependency generation; excluding them prevents cache
repopulation when those dependency definitions change.

.github/workflows/gh-cache.yml[4-15]
MODULE.bazel[63-103]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`.github/workflows/gh-cache.yml` uses a `push.paths` filter to decide when to repopulate the GitHub cache, but it does not include key dependency inputs (`pnpm-lock.yaml`, `multitool.lock.json`) that are used by Bazel to generate external repositories. This means macOS/Windows cache population will not run when those files change, leaving caches stale.

### Issue Context
`MODULE.bazel` references both `//:multitool.lock.json` (rules_multitool hub) and `//:pnpm-lock.yaml` (npm_translate_lock). These files directly affect what external assets Bazel will download.

### Fix Focus Areas
- .github/workflows/gh-cache.yml[4-15]

### Proposed fix
Add the missing files to the `on.push.paths` list (at minimum `pnpm-lock.yaml` and `multitool.lock.json`). Consider also including other npm-related inputs referenced by `npm_translate_lock` (e.g. `package.json`, `pnpm-workspace.yaml`, `.npmrc`, and relevant `javascript/**/package.json`) if you want cache population to run immediately when those change.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Failure becomes cancelled run 🐞 Bug ☼ Reliability
Description
The reusable .github/workflows/bazel.yml cancels the entire workflow run when Bazel fails (`gh run
cancel ${{ github.run_id }}) if inputs.cache-save` is true, which can turn real failures into an
overall “cancelled” conclusion and disrupt failure-based CI signaling/branch protections. In matrix
callers like gh-cache (macOS/Windows with fail-fast: false), a single OS failure can also cancel
the sibling job, preventing it from completing and saving its cache despite the intended independent
cache population.
Code

.github/workflows/bazel.yml[R299-304]

Evidence
The reusable workflow .github/workflows/bazel.yml contains logic that invokes gh run cancel with
github.run_id when the Bazel step fails while cache-save is enabled, which necessarily cancels
the entire workflow run rather than only the failing job. Separately,
.github/workflows/gh-cache.yml runs a macOS/Windows matrix with fail-fast: false and enables
cache-save: true for each leg, making that cancellation path reachable in either OS job; as a
result, a Bazel failure in one leg can terminate the whole run and thereby stop the other matrix job
from completing and saving its cache.

.github/workflows/bazel.yml[299-304]
.github/workflows/gh-cache.yml[27-42]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
A Bazel failure currently triggers `gh run cancel ${{ github.run_id }}` in the reusable Bazel workflow when `inputs.cache-save` is enabled, which cancels the entire workflow run. This can mask genuine build failures as cancellations, disrupt failure-based CI signaling/branch protections, and in matrix callers (e.g., gh-cache with `fail-fast: false`) it can also cancel sibling OS jobs and prevent them from completing and saving their caches.

## Issue Context
The run-cancellation step lives in the reusable workflow, so it impacts every workflow that calls it with `cache-save: true`. In `gh-cache.yml`, each macOS/Windows matrix leg sets `cache-save: true`, so a failure in one OS leg can trigger run-wide cancellation and terminate the other leg even though `fail-fast: false` is intended to allow them to proceed independently.

## Fix Focus Areas
- .github/workflows/bazel.yml[299-304]
- .github/workflows/gh-cache.yml[27-42]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Previous review results

Review updated until commit 21a1dae

Results up to commit b79faaf


🐞 Bugs (1) 📘 Rule violations (0) 📎 Requirement gaps (0)


Action required
1. Failure becomes cancelled run 🐞 Bug ☼ Reliability
Description
The reusable .github/workflows/bazel.yml cancels the entire workflow run when Bazel fails (`gh run
cancel ${{ github.run_id }}) if inputs.cache-save` is true, which can turn real failures into an
overall “cancelled” conclusion and disrupt failure-based CI signaling/branch protections. In matrix
callers like gh-cache (macOS/Windows with fail-fast: false), a single OS failure can also cancel
the sibling job, preventing it from completing and saving its cache despite the intended independent
cache population.
Code

.github/workflows/bazel.yml[R299-304]

Evidence
The reusable workflow .github/workflows/bazel.yml contains logic that invokes gh run cancel with
github.run_id when the Bazel step fails while cache-save is enabled, which necessarily cancels
the entire workflow run rather than only the failing job. Separately,
.github/workflows/gh-cache.yml runs a macOS/Windows matrix with fail-fast: false and enables
cache-save: true for each leg, making that cancellation path reachable in either OS job; as a
result, a Bazel failure in one leg can terminate the whole run and thereby stop the other matrix job
from completing and saving its cache.

.github/workflows/bazel.yml[299-304]
.github/workflows/gh-cache.yml[27-42]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
A Bazel failure currently triggers `gh run cancel ${{ github.run_id }}` in the reusable Bazel workflow when `inputs.cache-save` is enabled, which cancels the entire workflow run. This can mask genuine build failures as cancellations, disrupt failure-based CI signaling/branch protections, and in matrix callers (e.g., gh-cache with `fail-fast: false`) it can also cancel sibling OS jobs and prevent them from completing and saving their caches.

## Issue Context
The run-cancellation step lives in the reusable workflow, so it impacts every workflow that calls it with `cache-save: true`. In `gh-cache.yml`, each macOS/Windows matrix leg sets `cache-save: true`, so a failure in one OS leg can trigger run-wide cancellation and terminate the other leg even though `fail-fast: false` is intended to allow them to proceed independently.

## Fix Focus Areas
- .github/workflows/bazel.yml[299-304]
- .github/workflows/gh-cache.yml[27-42]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Qodo Logo

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds CI support to proactively populate and safely persist the Bazel repository cache on GitHub Actions, reducing failures caused by transient upstream download issues and avoiding cache “poisoning”.

Changes:

  • Introduces a scheduled/paths-triggered gh-cache workflow that runs bazel build --nobuild on macOS and Windows and saves the repository cache.
  • Grants actions: write permission to workflows that need to save/cache or cancel runs.
  • Adds a cancel-on-Bazel-failure step in the reusable bazel.yml workflow to prevent saving a bad cache.

Note: after updating files in this repo, run (or have CI run) ./go format before merging to avoid formatter-related CI failures.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
.github/workflows/gh-cache.yml New/updated workflow to warm and save Bazel repository cache for macOS and Windows on trunk/schedule.
.github/workflows/ci-rbe.yml Grants actions: write permission so cache-save/cancel behavior can work on trunk.
.github/workflows/bazel.yml Removes cache-version and adds a failure-triggered run-cancel guard intended to prevent poisoned cache saves.

Comment thread .github/workflows/gh-cache.yml Outdated
Comment thread .github/workflows/bazel.yml Outdated
Comment thread .github/workflows/bazel.yml Outdated
@titusfortner titusfortner force-pushed the c/vigilant-feynman-72977b branch from b79faaf to 21a1dae Compare May 27, 2026 12:25
@qodo-code-review
Copy link
Copy Markdown
Contributor

qodo-code-review Bot commented May 27, 2026

Code review by qodo was updated up to the latest commit 21a1dae

Comment on lines +4 to +15
push:
branches: [trunk]
paths:
- 'MODULE.bazel'
- 'rust/Cargo.lock'
- 'java/maven_install.json'
- 'py/requirements_lock.txt'
- 'rb/Gemfile.lock'
- 'dotnet/paket.lock'
- 'common/repositories.bzl'
- 'common/browsers.bzl'
schedule:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Cache triggers miss lockfiles 🐞 Bug ☼ Reliability

The gh-cache workflow’s push path filter omits pnpm-lock.yaml and multitool.lock.json, so updates to
these Bazel dependency inputs won’t trigger cache population for macOS/Windows. Because MODULE.bazel
uses these files to generate external repositories, CI may still need to download new artifacts (and
hit the same transient 502/cert failures) until the next scheduled run.
Agent Prompt
### Issue description
`.github/workflows/gh-cache.yml` uses a `push.paths` filter to decide when to repopulate the GitHub cache, but it does not include key dependency inputs (`pnpm-lock.yaml`, `multitool.lock.json`) that are used by Bazel to generate external repositories. This means macOS/Windows cache population will not run when those files change, leaving caches stale.

### Issue Context
`MODULE.bazel` references both `//:multitool.lock.json` (rules_multitool hub) and `//:pnpm-lock.yaml` (npm_translate_lock). These files directly affect what external assets Bazel will download.

### Fix Focus Areas
- .github/workflows/gh-cache.yml[4-15]

### Proposed fix
Add the missing files to the `on.push.paths` list (at minimum `pnpm-lock.yaml` and `multitool.lock.json`). Consider also including other npm-related inputs referenced by `npm_translate_lock` (e.g. `package.json`, `pnpm-workspace.yaml`, `.npmrc`, and relevant `javascript/**/package.json`) if you want cache population to run immediately when those change.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

B-build Includes scripting, bazel and CI integrations

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants